22 research outputs found
On Mixup Training: Improved Calibration and Predictive Uncertainty for Deep Neural Networks
Mixup~\cite{zhang2017mixup} is a recently proposed method for training deep
neural networks where additional samples are generated during training by
convexly combining random pairs of images and their associated labels. While
simple to implement, it has shown to be a surprisingly effective method of data
augmentation for image classification; DNNs trained with mixup show noticeable
gains in classification performance on a number of image classification
benchmarks. In this work, we discuss a hitherto untouched aspect of mixup
training -- the calibration and predictive uncertainty of models trained with
mixup. We find that DNNs trained with mixup are significantly better calibrated
-- i.e., the predicted softmax scores are much better indicators of the actual
likelihood of a correct prediction -- than DNNs trained in the regular fashion.
We conduct experiments on a number of image classification architectures and
datasets -- including large-scale datasets like ImageNet -- and find this to be
the case. Additionally, we find that merely mixing features does not result in
the same calibration benefit and that the label smoothing in mixup training
plays a significant role in improving calibration. Finally, we also observe
that mixup-trained DNNs are less prone to over-confident predictions on
out-of-distribution and random-noise data. We conclude that the typical
overconfidence seen in neural networks, even on in-distribution data is likely
a consequence of training with hard labels, suggesting that mixup training be
employed for classification tasks where predictive uncertainty is a significant
concern
Federated Representation Learning for Automatic Speech Recognition
Federated Learning (FL) is a privacy-preserving paradigm, allowing edge
devices to learn collaboratively without sharing data. Edge devices like Alexa
and Siri are prospective sources of unlabeled audio data that can be tapped to
learn robust audio representations. In this work, we bring Self-supervised
Learning (SSL) and FL together to learn representations for Automatic Speech
Recognition respecting data privacy constraints. We use the speaker and chapter
information in the unlabeled speech dataset, Libri-Light, to simulate non-IID
speaker-siloed data distributions and pre-train an LSTM encoder with the
Contrastive Predictive Coding framework with FedSGD. We show that the
pre-trained ASR encoder in FL performs as well as a centrally pre-trained model
and produces an improvement of 12-15% (WER) compared to no pre-training. We
further adapt the federated pre-trained models to a new language, French, and
show a 20% (WER) improvement over no pre-training.Comment: Accepted at ISCA SPSC Symposium 3rd Symposium on Security and Privacy
in Speech Communication, 202
BB-ML: Basic Block Performance Prediction using Machine Learning Techniques
Recent years have seen the adoption of Machine Learning (ML) techniques to
predict the performance of large-scale applications, mostly at a coarse level.
In contrast, we propose to use ML techniques for performance prediction at a
much finer granularity, namely at the Basic Block (BB) level, which are single
entry, single exit code blocks that are used for analysis by the compilers to
break down a large code into manageable pieces. We extrapolate the basic block
execution counts of GPU applications and use them for predicting the
performance for large input sizes from the counts of smaller input sizes. We
train a Poisson Neural Network (PNN) model using random input values as well as
the lowest input values of the application to learn the relationship between
inputs and basic block counts. Experimental results show that the model can
accurately predict the basic block execution counts of 16 GPU benchmarks. We
achieve an accuracy of 93.5% in extrapolating the basic block counts for large
input sets when trained on smaller input sets and an accuracy of 97.7% in
predicting basic block counts on random instances. In a case study, we apply
the ML model to CUDA GPU benchmarks for performance prediction across a
spectrum of applications. We use a variety of metrics for evaluation, including
global memory requests and the active cycles of tensor cores, ALU, and FMA
units. Results demonstrate the model's capability of predicting the performance
of large datasets with an average error rate of 0.85% and 0.17% for global and
shared memory requests, respectively. Additionally, to address the utilization
of the main functional units in Ampere architecture GPUs, we calculate the
active cycles for tensor cores, ALU, FMA, and FP64 units and achieve an average
error of 2.3% and 10.66% for ALU and FMA units while the maximum observed error
across all tested applications and units reaches 18.5%.Comment: Accepted at the 29th IEEE International Conference on Parallel and
Distributed Systems (ICPADS 2023
ILASR: Privacy-Preserving Incremental Learning for Automatic Speech Recognition at Production Scale
Incremental learning is one paradigm to enable model building and updating at
scale with streaming data. For end-to-end automatic speech recognition (ASR)
tasks, the absence of human annotated labels along with the need for privacy
preserving policies for model building makes it a daunting challenge. Motivated
by these challenges, in this paper we use a cloud based framework for
production systems to demonstrate insights from privacy preserving incremental
learning for automatic speech recognition (ILASR). By privacy preserving, we
mean, usage of ephemeral data which are not human annotated. This system is a
step forward for production levelASR models for incremental/continual learning
that offers near real-time test-bed for experimentation in the cloud for
end-to-end ASR, while adhering to privacy-preserving policies. We show that the
proposed system can improve the production models significantly(3%) over a new
time period of six months even in the absence of human annotated labels with
varying levels of weak supervision and large batch sizes in incremental
learning. This improvement is 20% over test sets with new words and phrases in
the new time period. We demonstrate the effectiveness of model building in a
privacy-preserving incremental fashion for ASR while further exploring the
utility of having an effective teacher model and use of large batch sizes.Comment: 9 page